List of projects 2 . 1 Web crawler for collecting Hebrew and bilingual corpora
نویسنده
چکیده
The Lab offers a number of practical projects in Natural Language Processing, mostly geared towards processing of Hebrew. Some projects require previous knowledge of computational linguistics but some assume no previous background. All projects involve programming: the end result is a relatively large-scale, well-documented and efficient software package. Some of the projects may involve also some research (e.g., reading a research paper and implementing its ideas).
منابع مشابه
A modular open-source focused crawler for mining monolingual and bilingual corpora from the web
This paper discusses a modular and opensource focused crawler (ILSP-FC) for the automatic acquisition of domain-specific monolingual and bilingual corpora from the Web. Besides describing the main modules integrated in the crawler (dealing with page fetching, normalization, cleaning, text classification, de-duplication and document pair detection), we evaluate several of the system functionalit...
متن کامل: from Corpus Compilation to Bilingual Terminologies for MT and CAT Tools
This paper describes the TTC Web platform, an online demonstrator to show the whole pipeline to compile bilingual terminologies out of comparable corpora gathered from the web using the tools developed in the TTC project Terminology Extraction, Translation Tools and Comparable Corpora. We present the whole chain which has been integrated into the platform, as well as their main components: a fo...
متن کاملLanguage Specific and Topic Focused Web Crawling
We describe an experiment on collecting large language and topic specific corpora automatically by using a focused Web crawler. Our crawler combines efficient crawling techniques with a common text classification tool. Given a sample corpus of medical documents, we automatically extract query phrases and then acquire seed URLs with a standard search engine. Starting from these seed URLs, the cr...
متن کاملResources for Processing Hebrew
We describe work in progress whose main objective is to create a collection of resources and tools for processing Hebrew. These resources include corpora of written texts, some of them annotated in various degrees of detail; tools for collecting, expanding and maintaining corpora; tools for annotation; lexicons, both monolingual and bilingual; a rule-based, linguistically motivated morphologica...
متن کاملPrioritize the ordering of URL queue in Focused crawler
The enormous growth of the World Wide Web in recent years has made it necessary to perform resource discovery efficiently. For a crawler it is not an simple task to download the domain specific web pages. This unfocused approach often shows undesired results. Therefore, several new ideas have been proposed, among them a key technique is focused crawling which is able to crawl particular topical...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2004